13  Interpretable Models

13.1 Planning Database

  • Motivation: The decennial census is used to allocate hundreds of billions of dollars and to redistrict political power. Ensuring an accurate count is imperative. The Census Bureau uses the planning database (PDB) and a modeled low-response score (LRS) to plan outreach to improve response rates.
  • Implementation data: Demographic and housing information for the American Community Survey at the census tract and census block group levels.
  • Modeling data: Demographic and housing information for the American Community Survey and the low-response score for the 2010 Decennial Census at the census tract and census block group levels.
  • Objective: Predict which areas will have the lowest response rates to the 2020 Decennial Census.
  • Tools: Linear regression based on the top 25 predictors from gradient-boosted trees.
  • Results: Unclear but we’ll see more in these notes!

13.1.1 LRS Set Up

We will work with the tract-level PDB instead of the block-group level PDB to simplify computation. The PDB contains more than 500 variables. For now, we only consider the top 25 variables to further simplify computation.

pdb_small <- read_csv(here::here("data", "pdb_small.csv"))
Rows: 69650 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): State_name, County_name
dbl (24): Low_Response_Score, non_return_rate, Renter_Occp_HU_ACS_13_17, Pop...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
set.seed(20231114)
pdb_split <- initial_split(pdb_small, prop = 0.8)

pdb_train <- training(pdb_split)

Lastly, we set up \(v\)-fold cross validation. We only use five folds because some of the models take a long time to fit to the data.

pdb_folds <- vfold_cv(data = pdb_train, v = 5)

13.2 Motivation

The model’s that won the planning database Kaggle competition all used ensembled trees. They had high predictive power but were difficult to interpret and difficult to use for inference.

In the end, the Census Bureau used the ensembled trees to inform a linear regression model that was easier to interpret. In other words, the Census Bureau sacrificed predictive power to gain interpretability.

This section will introduce extensions of linear regression models that aim to improve predictive power without quickly becoming difficult to interpret or “black box.”

Like in earlier sections, we will use the simulated example 1 data to demonstrate some model fits.

Code
library(tidymodels)

set.seed(20201004)

x <- runif(n = 1000, min = 0, max = 10)

data1 <- bind_cols(
  x = x,
  y = 10 * sin(x) + x + 20 + rnorm(n = length(x), mean = 0, sd = 2)
)

set.seed(20201007)

# create a split object
data1_split <- initial_split(data = data1, prop = 0.75)

# create the training and testing data
data1_train <- training(x = data1_split)
data1_test  <- testing(x = data1_split)

13.3 Polynomial Regression

Polynomial Regression

Polynomial regression includes higher-degree predictors in the design matrix. For example, a regression model will include \(x\) and \(x^2\) as predictors.

Polynomials extend linear regression to include transformations of predictors such that

\[ y_i = \beta_0 + \beta_1x_i + \beta_2x_i^2 + \cdot\cdot\cdot \beta_dx_i^d + \epsilon_i \]

where \(d\) is referred to as the degree of the polynomial.

Polynomial regression is still linear regression because the model is linear in its coefficients but can fit non-linear patterns in the data

Figure 13.1 shows four different linear regression models fit to the training data. Degree is the magnitude of the highest order term included in the model.

Code
lin_reg_model1 <- linear_reg() |>
  set_engine(engine = "lm") |>
  set_mode(mode = "regression") |>
  fit(formula = y ~ x, data = data1_train)

lin_reg_model2 <- linear_reg() |>
  set_engine(engine = "lm") |>
  set_mode(mode = "regression") |>
  fit(formula = y ~ poly(x, degree = 2, raw = TRUE), data = data1_train)

lin_reg_model3 <- linear_reg() |>
  set_engine(engine = "lm") |>
  set_mode(mode = "regression") |>
  fit(formula = y ~ poly(x, degree = 3, raw = TRUE), data = data1_train)

lin_reg_model4 <- linear_reg() |>
  set_engine(engine = "lm") |>
  set_mode(mode = "regression") |>
  fit(formula = y ~ poly(x, degree = 4, raw = TRUE), data = data1_train)

# create a grid of predictions
new_data <- tibble(x = seq(0, 10, 0.1))

predictions_grid <- tibble(
  x = seq(0, 10, 0.1),
  `Degree = 1` = predict(object = lin_reg_model1, new_data = new_data)$.pred,
  `Degree = 2` = predict(object = lin_reg_model2, new_data = new_data)$.pred,
  `Degree = 3` = predict(object = lin_reg_model3, new_data = new_data)$.pred,
  `Degree = 4` = predict(object = lin_reg_model4, new_data = new_data)$.pred
) |>
  pivot_longer(-x, names_to = "model", values_to = ".pred")

ggplot() +
  geom_point(data = data1_train, aes(x = x, y = y), alpha = 0.25) +
  geom_path(data = predictions_grid, aes(x = x, y = .pred), color = "red") +
  facet_wrap(~model) +
  labs(
    title = "Example 1: Data with Predictions (Linear Regression)",
    subtitle = "Prediction in red"
  ) +
  theme_minimal()

Figure 13.1: ?(caption)

13.4 Interactions

Interactions

Interactions include the product of predictors in the design matrix for linear regression. For example, a regression model will include \(x_1\), \(x_2\), and \(x_1x_2\) as predictors.

Interactions extend polynomials to consider multiplying predictors by other predictors in addition to multiplying them by themselves. Including pairwise or triplewise interactions complicates interpretation but can improve the fit of the model to the data.

Hierarchal Principle

The hierarchical principle states that main effects should be included in regression models when interactions are included even if the main effects are not statistically significant. This includes polynomial regression.

Outside of applications where a small number of interactions is included because of theory, interactions can explode the number of parameters that require estimation.

“A key problem in the estimation of two-way interaction models is the explosion in the number of unknown parameters.” (Ibrahim et al. 2021)

Simple pairwise interactions for the small planning database example explodes the number of predictors from 23 to 276.

recipe(formula =  ~ ., data = pdb_train) |>
  update_role(State_name, County_name, Low_Response_Score, new_role = "id") |>
  step_interact(terms = ~ all_predictors():all_predictors()) |>
  prep(new_data = pdb_train) |>
  bake(NULL)
# A tibble: 55,720 × 279
   State_name     County_name           Low_Response_Score non_return_rate
   <fct>          <fct>                              <dbl>           <dbl>
 1 Arizona        Yavapai County                      15.2            16.5
 2 Georgia        Screven County                      23.5            23  
 3 Iowa           Polk County                         21.4            24.4
 4 Wyoming        Sweetwater County                   26              14.8
 5 Missouri       Jasper County                       22.8            17.6
 6 Virginia       Hanover County                      12.1            10.3
 7 Oregon         Clackamas County                    19.5            21.8
 8 North Carolina Chatham County                      20.1            14.7
 9 North Carolina Mecklenburg County                  17.6            15.3
10 California     San Bernardino County               28              24.9
# ℹ 55,710 more rows
# ℹ 275 more variables: Renter_Occp_HU_ACS_13_17 <dbl>,
#   Pop_18_24_ACS_13_17 <dbl>, Female_No_HB_ACS_13_17 <dbl>,
#   NH_White_alone_ACS_13_17 <dbl>, Pop_65plus_ACS_13_17 <dbl>,
#   Rel_Child_Under_6_ACS_13_17 <dbl>, Males_ACS_13_17 <dbl>,
#   MrdCple_Fmly_HHD_ACS_13_17 <dbl>, Pop_25_44_ACS_13_17 <dbl>,
#   Tot_Vacant_Units_ACS_13_17 <dbl>, College_ACS_13_17 <dbl>, …

Regularization with methods like LASSO regression and cross validation can help in the presence of many predictors.

13.5 Step Functions

Step functions are not often used directly in predictive modeling but they are an important building block for other methods.

Step functions break the range of a predictor into bins and then fit different constant (intercept only models) in each bin. Figure 13.2 shows piecewise constant regression fit to the simulated example 1 data. The example uses 9 separate regressions (8 knots).

Code
# create a simple linear regression model
lin_reg_model <- linear_reg() |>
  set_engine(engine = "lm") |>
  set_mode(mode = "regression") 

# specify a piecewise constant model
step_function_rec <- recipe(formula = y ~ x, data = data1_train) |>
  step_spline_b(all_predictors(), degree = 0, deg_free = 8)

# fit the model
step_function_mod <- workflow() |>
  add_model(spec = lin_reg_model) |>
  add_recipe(recipe = step_function_rec) |>
  fit(data = data1_train) 

# create a grid of predictions
new_data <- tibble(x = seq(0, 10, 0.1))

predictions_grid <- tibble(
  x = seq(0, 10, 0.1),
  `Piecewise Constant Regression` = predict(object = step_function_mod, new_data = new_data)$.pred
) |>
  pivot_longer(-x, names_to = "model", values_to = ".pred")

ggplot() +
  geom_point(data = data1_train, aes(x = x, y = y), alpha = 0.25) +
  geom_line(
    data = filter(predictions_grid, model != "Continuous, Cubic"),
    mapping = aes(x = x, y = .pred), 
    color = "red"
  ) +
  facet_wrap(~model) +
  labs(
    title = "Example 1: Data with Predictions (Piecewise Constant Regression)",
    subtitle = "Prediction in red"
  ) +
  theme_minimal()

Figure 13.2: ?(caption)

We can now see the connection between piecewise constant regression and regression trees. Regression trees picks the knots, where we must specify the knots for piecewise constant regression.

13.6 Basis Functions

Basis Functions

Basis functions are fixed and known transformations applied to a variable \(x\) before modeling. We can denote the basis functions as \(b_1(), b_2(), ..., b_K()\). Basis functions result in “derived features.”

Consider a few simple basis function:

  • Linear: \(b_j(x_i) = x_i\)
  • Polynomial: \(b_j(x_i) = x_i^j\)
  • Piecewise constant: \(b_j(x_i) = I(c_j \leq x_i < c_{j + 1})\)

Basis functions are advantageous because if we fit a model like

\[ y_i = \beta_0 + \beta_1b_1(x_i) + \beta_1b_1(x_i) + \cdot\cdot\cdot + \beta_kb_K(x_i) + \epsilon_i \]

then we can use least squares estimation and we have access to all of the tools available to linear regression like standard errors and statistical tests. Next, we will explore a few more basis functions.

13.7 Regression Splines

Regression Splines

Regression splines are piecewise polynomial basis functions with a constraint that the fitted curve must be continuous and have continuous first and second derivatives.

Regression splines have a degree, which determines the order of the polynomials (e.g. 1 is linear and and 3 is cubic). Regression splines also have knots, which are the break points between regions.

Regression splines combine piecewise step functions and polynomial regression with certain constraints using basis functions. Regression splines are called nonparametric even though they result in estimated regression coefficients.

Knots are where the regions meet with regression splines. Regression splines are continuous at knots and have continuous first and second derivatives at knots. This ensures the fitted line.

We won’t show the basis functions for simplicity. Just recall, basis functions transform the variable \(x\) before fitting the regression model.

Figure 13.3 shows several piecewise polynomial regression models fit to the simulated example 1 data. The first panel shows a piecewise linear model. Note the discontinuity. The second panel shows a piecewise square model. Again, note the discontinuity. The bottom two panels shows linear and square models with an added continuity conditions.

Code
# create a simple linear regression model
lin_reg_model <- linear_reg() |>
  set_engine(engine = "lm") |>
  set_mode(mode = "regression") 

# create a workflow with a model but no recipe
base_workflow <- workflow() |>
  add_model(spec = lin_reg_model)

# create a series of custom recipes
piecewise_linear_rec <- recipe(formula = y ~ x, data = data1_train) |>
  step_mutate(ntile = factor(ntile(x, n = 4))) |>
  step_dummy(ntile) |>
  step_interact(terms = ~ x:starts_with("ntile"))

piecewise_square_rec <- recipe(formula = y ~ x, data = data1_train) |>
  step_mutate(ntile = factor(ntile(x, n = 4))) |>
  step_dummy(ntile) |>
  step_poly(x, options = list(raw = FALSE)) |>
  step_interact(terms = ~ starts_with("x"):starts_with("n"))

continuous_linear_rec <- recipe(formula = y ~ x, data = data1_train) |>
  step_spline_b(all_predictors(), degree = 1, deg_free = 5)

continuous_square_rec <- recipe(formula = y ~ x, data = data1_train) |>
  step_spline_b(all_predictors(), degree = 2, deg_free = 6)

continuous_cubic_rec <- recipe(formula = y ~ x, data = data1_train) |>
  step_spline_b(all_predictors(), degree = 3, deg_free = 7)

# add the custom recipes and fit the models
piecewise_linear_mod <- base_workflow |>
  add_recipe(recipe = piecewise_linear_rec) |>
  fit(data = data1_train) 

piecewise_square_mod <- base_workflow |>
  add_recipe(recipe = piecewise_square_rec) |>
  fit(data = data1_train) 

continuous_linear_mod <- base_workflow |>
  add_recipe(recipe = continuous_linear_rec) |>
  fit(data = data1_train)

continuous_square_mod <- base_workflow |>
  add_recipe(recipe = continuous_square_rec) |>
  fit(data = data1_train)

continuous_cubic_mod <- base_workflow |>
  add_recipe(recipe = continuous_cubic_rec) |>
  fit(data = data1_train)

# create a grid of predictions
new_data <- tibble(x = seq(0, 10, 0.1))

predictions_grid <- tibble(
  x = seq(0, 10, 0.1),
  `Piecewise, Linear` = predict(object = piecewise_linear_mod, new_data = new_data)$.pred,
  `Piecewise, Square` = predict(object = piecewise_square_mod, new_data = new_data)$.pred,
  `Continuous, Linear` = predict(object = continuous_linear_mod, new_data = new_data)$.pred,
  `Continuous, Square` = predict(object = continuous_square_mod, new_data = new_data)$.pred,
  `Continuous, Cubic` = predict(object = continuous_cubic_mod, new_data = new_data)$.pred
) |>
  pivot_longer(-x, names_to = "model", values_to = ".pred") |>
  mutate(model = factor(model, levels = c("Piecewise, Linear", "Piecewise, Square", "Continuous, Linear", "Continuous, Square", "Continuous, Cubic")))

ggplot() +
  geom_point(data = data1_train, aes(x = x, y = y), alpha = 0.25) +
  geom_line(
    data = filter(predictions_grid, model != "Continuous, Cubic"),
    mapping = aes(x = x, y = .pred), 
    color = "red"
  ) +
  facet_wrap(~model) +
  labs(
    title = "Example 1: Data with Predictions (Piecewise Polynomials)",
    subtitle = "Prediction in red"
  ) +
  theme_minimal()

Figure 13.3: ?(caption)

Finally, Figure 13.4 shows a cubic spline model fit to the data. The model uses cubic polynomials and four knots. Recall that it took a 4th-degree polynomial to fit the pattern in the data.

Code
ggplot() +
  geom_point(data = data1_train, aes(x = x, y = y), alpha = 0.25) +
  geom_line(
    data = filter(predictions_grid, model == "Continuous, Cubic"),
    mapping = aes(x = x, y = .pred), 
    color = "red"
  ) +
  facet_wrap(~model) +
  labs(
    title = "Example 1: Data with Predictions (Cubic Regression splines)",
    subtitle = "Prediction in red"
  ) +
  theme_minimal()

Figure 13.4: Regression splines with cubic polynomials and four knots. Recall that it took a 4th-degree polynomial to fit the pattern in the data.

Higher degrees and more knots will fit the training data better but run the risk of resulting in an overfit model.

Most data scientists don’t use degree > 3 because it is said that the human eye can’t detect discontinuity for any degree \(\geq\) 3.

This leaves picking the number and location of the knots. The location of knots is typically evenly spaced out (e.g. three knots will be placed at the 25th, 50th, and 75th percentiles). The number of knots is limited by the number of observations in the data set. One popular approach is to propose many knots and then use regularization methods like LASSO regression with cross validation to drop unnecessary knots.

Basis splines

Basis splines are regression splines where every region has the same polynomial.

Natural splines

Natural splines are basis splines with the added constraint that the minimum region and maximum region (the area outside the most extreme knots) are constrained to be linear.

High-degree polynomials have high variance in sparse regions of the data and at the extremes of data. This means minor changes in the data can dramatically change the fitted line.1

13.8 Generalized Additive Models

Additive assumption

The additive assumption of linear regression states that the contribution of each predictor to the outcome variable is independent of the other predictors.

Interactions are one approach to address violations of the additivity assumption.

We only considered basis functions in simple linear regression applications (starting with exactly one predictor) until now. Generalized additive models (GAMs) allow us to use non-linear functions of multiple predictors while maintaining additivity.

Generalized Additive Models

Generalized additive models are of the form

\[ y_i = \beta_0 + \sum_{j = 1}^pf_j(x_{ij}) + \epsilon_i \]

In other words, the fitted line/conditional mean is the sum of the contributions of each of the predictors. The contributions can be modeled as non-linear functions (basis functions) of each predictor (polynomials, splines, etc.)

13.9 Multivariate Adaptive Regression Splines (MARS)

Multivariate Adaptive Regression Splines (MARS) are an alternative to GAMs for capturing non-linear partners in data using linear regression and splines.

Hinge Function

\[ h(x) = \begin{cases} x & x > 0 \\ 0 & x \leq 0 \end{cases} \]

or

\[ h(x) = \begin{cases} max(0, x - c) & x > c\\ max(0, c - x) & x \leq c \end{cases} \]

Multivariate Adaptive Regression Splines (MARS)

MARS is an additive model of the form

\[ y_i = \sum_{j= 1}^pc_jB_j(x_i) + \epsilon_i \]

where \(c_j\) is a linear regression coefficient and \(B_j(x_i)\) is

  1. 1
  2. A hinge function
  3. A product of hinge functions

MARS proceeds in three steps

  1. Find the knot that creates a piecewise regression model of the following form with the lowest SSE.

\[ y = \beta_0 + \beta_1h(x) \]

  1. Repeat this process recursively until some stopping parameter is achieved. For example, the second step will have the form.

\[ y = \beta_0 + \beta_1h_1(x) + \beta_2h_2(x) \]

  1. Prune with cross-validation.

Figure 13.5 demonstrates MARS with 1, 2, 3, and 4 knots using the simulated example 1 data.

Code
# fit MARS with an increasing number of knots
mars1 <- mars(num_terms = 2) |>
  set_engine(engine = "earth") |>
  set_mode(mode = "regression") |>
  fit(formula = y ~ x, data = data1_train)

mars2 <- mars(num_terms = 3) |>
  set_engine(engine = "earth") |>
  set_mode(mode = "regression") |>
  fit(formula = y ~ x, data = data1_train)

mars3 <- mars(num_terms = 4) |>
  set_engine(engine = "earth") |>
  set_mode(mode = "regression") |>
  fit(formula = y ~ x, data = data1_train)

mars4 <- mars(num_terms = 5) |>
  set_engine(engine = "earth") |>
  set_mode(mode = "regression") |>
  fit(formula = y ~ x, data = data1_train)

# create a grid of predictions
new_data <- tibble(x = seq(0, 10, 0.1))

predictions_grid <- tibble(
  x = seq(0, 10, 0.1),
  `Knots = 1` = predict(object = mars1, new_data = new_data)$.pred,
  `Knots = 2` = predict(object = mars2, new_data = new_data)$.pred,
  `Knots = 3` = predict(object = mars3, new_data = new_data)$.pred,
  `Knots = 4` = predict(object = mars4, new_data = new_data)$.pred
) |>
  pivot_longer(-x, names_to = "model", values_to = ".pred")

ggplot() +
  geom_point(data = data1_train, aes(x = x, y = y), alpha = 0.25) +
  geom_path(data = predictions_grid, aes(x = x, y = .pred), color = "red") +
  facet_wrap(~model) +
  labs(
    title = "Example 1: Data with Predictions (MARS)",
    subtitle = "Prediction in red"
  ) +
  theme_minimal()

Figure 13.5: ?(caption)

We can print the table of hinges and coefficients for the four models like this:

mars1$fit$coefficients
                    y
(Intercept)  22.82586
h(x-4.50551)  2.25023
mars2$fit$coefficients
                     y
(Intercept)  28.152720
h(x-4.50551)  9.135052
h(x-2.3873)  -5.294739
mars3$fit$coefficients
                      y
(Intercept)   28.813593
h(x-4.50551)  17.161781
h(x-2.3873)   -8.075389
h(x-7.37819) -12.883468
mars4$fit$coefficients
                      y
(Intercept)   37.598528
h(x-4.50551)  19.679189
h(4.50551-x)  -2.964782
h(x-8.12048) -16.167899
h(x-2.3873)  -11.946754

13.10 Case Study

We now return to the PDB example with 25 variables and five fold cross validation.

13.10.1 Linear Regression

We first consider multiple linear regression. We need a recipe, a model, and a workflow to fit the model five times and evaluate its predictive accuracy.

The data set contains a few variables that shouldn’t be included in the model. We use update_role() to turn them into “ids” to remove them from consideration.

lm1_rec <- recipe(non_return_rate ~ ., pdb_train) |>
  update_role(State_name, County_name, Low_Response_Score, new_role = "id")

lm1_mod <- linear_reg() |>
  set_mode(mode = "regression") |>
  set_engine(engine = "lm")

lm1_wf <- workflow() |>
  add_recipe(lm1_rec) |>
  add_model(lm1_mod)

lm1_resamples <- lm1_wf |>
  fit_resamples(resamples = pdb_folds)

lm1_resamples |>
  collect_metrics()
# A tibble: 2 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard   5.33      5 0.0580  Preprocessor1_Model1
2 rsq     standard   0.485     5 0.00906 Preprocessor1_Model1

We can use last_fit() to fit the model to all of the training data and extract_fit_parsnip() to examine the estimated coefficients.

lm1_wf |>
  last_fit(pdb_split) |>
  extract_fit_parsnip()
parsnip model object


Call:
stats::lm(formula = ..y ~ ., data = data)

Coefficients:
                (Intercept)     Renter_Occp_HU_ACS_13_17  
                  2.585e+01                    5.276e-04  
        Pop_18_24_ACS_13_17       Female_No_HB_ACS_13_17  
                  1.416e-03                    7.387e-04  
   NH_White_alone_ACS_13_17         Pop_65plus_ACS_13_17  
                 -9.401e-04                   -1.968e-03  
Rel_Child_Under_6_ACS_13_17              Males_ACS_13_17  
                  1.898e-03                    8.849e-04  
 MrdCple_Fmly_HHD_ACS_13_17          Pop_25_44_ACS_13_17  
                 -2.668e-03                    1.217e-03  
 Tot_Vacant_Units_ACS_13_17            College_ACS_13_17  
                  2.090e-03                   -3.449e-04  
      Med_HHD_Inc_ACS_13_17          Pop_45_64_ACS_13_17  
                 -9.768e-05                    1.730e-03  
     HHD_Moved_in_ACS_13_17           Hispanic_ACS_13_17  
                  3.872e-03                   -3.717e-04  
      Single_Unit_ACS_13_17    Diff_HU_1yr_Ago_ACS_13_17  
                 -2.633e-03                   -8.486e-04  
         Pop_5_17_ACS_13_17       NH_Blk_alone_ACS_13_17  
                  1.696e-03                    5.467e-04  
    Sngl_Prns_HHD_ACS_13_17        Not_HS_Grad_ACS_13_17  
                 -4.323e-03                   -2.103e-04  
  Med_House_Value_ACS_13_17  
                  9.283e-06  

13.10.2 Linear Regression

Erdman and Bates (2017) included square root and logit transformations in their final specification. We next update the recipe to include their functional form.

# create a complex recipe with square root, log, and logit transformations
lm2_rec <- recipe(non_return_rate ~ ., pdb_train) |>
  update_role(State_name, County_name, Low_Response_Score, new_role = "id") |>
  step_sqrt(
    Pop_18_24_ACS_13_17,
    Female_No_HB_ACS_13_17,
    Pop_65plus_ACS_13_17,
    Rel_Child_Under_6_ACS_13_17,
    Pop_25_44_ACS_13_17,
    Pop_45_64_ACS_13_17,
    HHD_Moved_in_ACS_13_17,
    Diff_HU_1yr_Ago_ACS_13_17,
    Pop_5_17_ACS_13_17,
    Pop_5_17_ACS_13_17,
    Sngl_Prns_HHD_ACS_13_17,
    Not_HS_Grad_ACS_13_17
  ) |>
  step_log(
    Med_HHD_Inc_ACS_13_17,
    Med_House_Value_ACS_13_17
  ) |>
  step_log(
    Tot_Vacant_Units_ACS_13_17,
    offset = 0.1
  ) |>
  add_role(
    Renter_Occp_HU_ACS_13_17,
    NH_White_alone_ACS_13_17,
    College_ACS_13_17,
    Hispanic_ACS_13_17,
    Single_Unit_ACS_13_17,
    NH_Blk_alone_ACS_13_17,
    new_role = "logit"
  ) |>
step_range(
    has_role("logit")
  ) |>
  step_logit(
    has_role("logit"),
    offset = 0.01
  )

# create a linear regression model
lm2_mod <- linear_reg() |>
  set_mode(mode = "regression") |>
  set_engine(engine = "lm")

# combine the recipe and model into a workflow
lm2_wf <- workflow() |>
  add_recipe(lm2_rec) |>
  add_model(lm2_mod)

# fit the model
lm2_resamples <- lm2_wf |>
  fit_resamples(resamples = pdb_folds)

Changing the functional form improves the model predictions.

lm2_resamples |>
  collect_metrics()
# A tibble: 2 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard   4.96      5 0.0484  Preprocessor1_Model1
2 rsq     standard   0.552     5 0.00642 Preprocessor1_Model1

We can use last_fit() to fit the model to all of the training data and extract_fit_parsnip() to examine the estimated coefficients.

lm2_wf |>
  last_fit(pdb_split) |>
  extract_fit_parsnip()
parsnip model object


Call:
stats::lm(formula = ..y ~ ., data = data)

Coefficients:
                (Intercept)     Renter_Occp_HU_ACS_13_17  
                  2.770e+01                    1.544e+00  
        Pop_18_24_ACS_13_17       Female_No_HB_ACS_13_17  
                  6.398e-02                    1.353e-01  
   NH_White_alone_ACS_13_17         Pop_65plus_ACS_13_17  
                 -6.925e-01                   -2.522e-01  
Rel_Child_Under_6_ACS_13_17              Males_ACS_13_17  
                 -9.002e-03                    2.715e-04  
 MrdCple_Fmly_HHD_ACS_13_17          Pop_25_44_ACS_13_17  
                  6.263e-05                    7.914e-02  
 Tot_Vacant_Units_ACS_13_17            College_ACS_13_17  
                  6.553e-01                   -1.097e+00  
      Med_HHD_Inc_ACS_13_17          Pop_45_64_ACS_13_17  
                 -2.897e+00                    5.043e-03  
     HHD_Moved_in_ACS_13_17           Hispanic_ACS_13_17  
                  3.193e-03                    5.260e-01  
      Single_Unit_ACS_13_17    Diff_HU_1yr_Ago_ACS_13_17  
                 -1.570e+00                   -2.573e-02  
         Pop_5_17_ACS_13_17       NH_Blk_alone_ACS_13_17  
                 -2.098e-03                    4.699e-01  
    Sngl_Prns_HHD_ACS_13_17        Not_HS_Grad_ACS_13_17  
                 -7.650e-02                    6.985e-03  
  Med_House_Value_ACS_13_17  
                  1.772e+00  

13.10.3 Elastic Net

Let’s try the (Erdman and Bates 2017) functional form with elastic net regularization.

enet_rec <- recipe(non_return_rate ~ ., pdb_train) |>
  update_role(State_name, County_name, Low_Response_Score, new_role = "id") |>
  step_sqrt(
    Pop_18_24_ACS_13_17,
    Female_No_HB_ACS_13_17,
    Pop_65plus_ACS_13_17,
    Rel_Child_Under_6_ACS_13_17,
    Pop_25_44_ACS_13_17,
    Pop_45_64_ACS_13_17,
    HHD_Moved_in_ACS_13_17,
    Diff_HU_1yr_Ago_ACS_13_17,
    Pop_5_17_ACS_13_17,
    Pop_5_17_ACS_13_17,
    Sngl_Prns_HHD_ACS_13_17,
    Not_HS_Grad_ACS_13_17
  ) |>
  step_log(
    Med_HHD_Inc_ACS_13_17,
    Med_House_Value_ACS_13_17
  ) |>
  step_log(
    Tot_Vacant_Units_ACS_13_17,
    offset = 0.1
  ) |>
  add_role(
    Renter_Occp_HU_ACS_13_17,
    NH_White_alone_ACS_13_17,
    College_ACS_13_17,
    Hispanic_ACS_13_17,
    Single_Unit_ACS_13_17,
    NH_Blk_alone_ACS_13_17,
    new_role = "logit"
  ) |>
  step_range(
    has_role("logit")
  ) |>
  step_logit(
    has_role("logit"),
    offset = 0.01
  ) |>
  step_normalize(all_predictors())

enet_mod <- linear_reg(penalty = tune(), mixture = tune()) |>
  set_mode(mode = "regression") |>
  set_engine(engine = "glmnet")  

enet_wf <- workflow() |>
  add_recipe(recipe = enet_rec) |>
  add_model(spec = enet_mod)
  
enet_grid <- grid_regular(
  penalty(), 
  mixture(),
  levels = 20
)

enet_tuning <- tune_grid(
  enet_wf,
  resamples = pdb_folds,
  grid = enet_grid
)

show_best(enet_tuning)
Warning: No value of `metric` was given; metric 'rmse' will be used.
# A tibble: 5 × 8
   penalty mixture .metric .estimator  mean     n std_err .config               
     <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                 
1 7.85e- 3   0.737 rmse    standard    4.96     5  0.0483 Preprocessor1_Model296
2 7.85e- 3   0.579 rmse    standard    4.96     5  0.0483 Preprocessor1_Model236
3 1   e-10   1     rmse    standard    4.96     5  0.0483 Preprocessor1_Model381
4 3.36e-10   1     rmse    standard    4.96     5  0.0483 Preprocessor1_Model382
5 1.13e- 9   1     rmse    standard    4.96     5  0.0483 Preprocessor1_Model383
autoplot(enet_tuning)

enet_tuned_wf <-
  enet_wf |>
  tune::finalize_workflow(tune::select_best(x = enet_tuning, metric = "rmse"))

enet_resamples <- enet_tuned_wf |>
  fit_resamples(resamples = pdb_folds)

Regularization doesn’t help much in this case. This makes sense because the regression specification is parsimonious and deliberate.

enet_resamples |>
  collect_metrics()
# A tibble: 2 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard   4.96      5 0.0483  Preprocessor1_Model1
2 rsq     standard   0.552     5 0.00642 Preprocessor1_Model1

13.10.4 Polynomial Regression

Next we add a 2nd-degree polynomial for all predictors.

polynomial_rec <- recipe(non_return_rate ~ ., pdb_train) |>
  update_role(State_name, County_name, Low_Response_Score, new_role = "id") |>
  step_poly(all_predictors())

lm_mod <- linear_reg() |>
  set_mode(mode = "regression") |>
  set_engine(engine = "lm")

polynomial_wf <- workflow() |>
  add_recipe(polynomial_rec) |>
  add_model(lm_mod)

polynomial_resamples <- polynomial_wf |>
  fit_resamples(resamples = pdb_folds)

The results are terrible. It’s possible the polynomials are poorly behaved and the model has issues with multicollinearity. In particular, look at the std_err of the mean RMSE. The inconsistently of the RMSE is troubling.

polynomial_resamples |>
  collect_metrics()
# A tibble: 2 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard   6.22      5  1.15   Preprocessor1_Model1
2 rsq     standard   0.450     5  0.0823 Preprocessor1_Model1

13.10.5 Polynomial Regression with Elastic Net

Let’s try adding 2nd degree polynomials and interactions to create a huge design matrix and then use elastic net regression for regularization.

poly_enet_rec <- recipe(non_return_rate ~ ., pdb_train) |>
  update_role(State_name, County_name, Low_Response_Score, new_role = "id") |>
  step_sqrt(
    Pop_18_24_ACS_13_17,
    Female_No_HB_ACS_13_17,
    Pop_65plus_ACS_13_17,
    Rel_Child_Under_6_ACS_13_17,
    Pop_25_44_ACS_13_17,
    Pop_45_64_ACS_13_17,
    HHD_Moved_in_ACS_13_17,
    Diff_HU_1yr_Ago_ACS_13_17,
    Pop_5_17_ACS_13_17,
    Pop_5_17_ACS_13_17,
    Sngl_Prns_HHD_ACS_13_17,
    Not_HS_Grad_ACS_13_17
  ) |>
  step_log(
    Med_HHD_Inc_ACS_13_17,
    Med_House_Value_ACS_13_17
  ) |>
  step_log(
    Tot_Vacant_Units_ACS_13_17,
    offset = 0.1
  ) |>
  add_role(
    Renter_Occp_HU_ACS_13_17,
    NH_White_alone_ACS_13_17,
    College_ACS_13_17,
    Hispanic_ACS_13_17,
    Single_Unit_ACS_13_17,
    NH_Blk_alone_ACS_13_17,
    new_role = "logit"
  ) |>
  step_range(
    has_role("logit")
  ) |>
  step_logit(
    has_role("logit"),
    offset = 0.01
  ) |>
  step_normalize(all_predictors()) |>
  step_poly(all_predictors(), degree = 3) |>
  step_interact(terms = ~starts_with("Renter"):all_predictors())

poly_enet_mod <- linear_reg(penalty = tune(), mixture = tune()) |>
  set_mode(mode = "regression") |>
  set_engine(engine = "glmnet")  

poly_enet_wf <- workflow() |>
  add_recipe(recipe = poly_enet_rec) |>
  add_model(spec = poly_enet_mod)
  
poly_enet_grid <- grid_regular(
  penalty(), 
  mixture(),
  levels = 20
)

poly_enet_tuning <- tune_grid(
  poly_enet_wf,
  resamples = pdb_folds,
  grid = poly_enet_grid
)

show_best(poly_enet_tuning)
Warning: No value of `metric` was given; metric 'rmse' will be used.
# A tibble: 5 × 8
  penalty mixture .metric .estimator  mean     n std_err .config               
    <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                 
1  0.0264   0.632 rmse    standard    4.73     5  0.0495 Preprocessor1_Model257
2  0.0264   0.579 rmse    standard    4.73     5  0.0505 Preprocessor1_Model237
3  0.0264   0.684 rmse    standard    4.73     5  0.0484 Preprocessor1_Model277
4  0.0264   0.789 rmse    standard    4.73     5  0.0474 Preprocessor1_Model317
5  0.0264   0.737 rmse    standard    4.73     5  0.0481 Preprocessor1_Model297
autoplot(poly_enet_tuning)

poly_enet_tuned_wf <-
  poly_enet_wf |>
  tune::finalize_workflow(tune::select_best(x = poly_enet_tuning, metric = "rmse"))

poly_enet_resamples <- poly_enet_tuned_wf |>
  fit_resamples(resamples = pdb_folds)

Here, elastic net regularization dramatically improves the model performance over the simple model and the saturated model with regularization.

poly_enet_resamples |>
  collect_metrics()
# A tibble: 2 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard   4.73      5 0.0495  Preprocessor1_Model1
2 rsq     standard   0.593     5 0.00639 Preprocessor1_Model1

13.10.6 GAM/Regression Splines

Let’s explore regression splines and a GAM. The syntax for regression splines in GAMs is slightly different because the tuning isn’t a hyperparameter.

Also, the model takes a long time to fit. To keep things simple, we only consider a few splines. Choosing the correct predictors for splines requires EDA and theory.

gam_spec <- gen_additive_mod(adjust_deg_free = tune()) |>
  set_mode(mode = "regression") |>
  set_engine(engine = "mgcv")

gam_wf <- workflow() |>
  add_recipe(lm1_rec) |>
  add_model(
    gam_spec,
    formula = non_return_rate ~ 
      s(Renter_Occp_HU_ACS_13_17) +
      s(Pop_18_24_ACS_13_17) +
      s(Female_No_HB_ACS_13_17) +
      s(NH_White_alone_ACS_13_17) +
      s(Pop_65plus_ACS_13_17) +
      s(Rel_Child_Under_6_ACS_13_17) +
      s(Males_ACS_13_17) +
      s(MrdCple_Fmly_HHD_ACS_13_17) +
      s(Pop_25_44_ACS_13_17) +
      s(Tot_Vacant_Units_ACS_13_17) +
      s(College_ACS_13_17) +
      s(Med_HHD_Inc_ACS_13_17) +
      s(Pop_45_64_ACS_13_17) +
      s(HHD_Moved_in_ACS_13_17) +
      Hispanic_ACS_13_17 +
      Single_Unit_ACS_13_17 +
      Diff_HU_1yr_Ago_ACS_13_17 +
      Pop_5_17_ACS_13_17 +
      NH_Blk_alone_ACS_13_17 +
      Sngl_Prns_HHD_ACS_13_17 +
      Not_HS_Grad_ACS_13_17 +
      Med_House_Value_ACS_13_17
  )

gam_grid <-
  grid_regular(adjust_deg_free(range = c(0.25, 5)), levels = 5)

gam_tune <-
  tune_grid(
    gam_wf, 
    resamples = pdb_folds,
    grid = gam_grid
  )

autoplot(gam_tune)

show_best(gam_tune)
Warning: No value of `metric` was given; metric 'rmse' will be used.
# A tibble: 5 × 7
  adjust_deg_free .metric .estimator  mean     n std_err .config             
            <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1            2.62 rmse    standard    4.87     5  0.0453 Preprocessor1_Model3
2            3.81 rmse    standard    4.87     5  0.0451 Preprocessor1_Model4
3            1.44 rmse    standard    4.87     5  0.0460 Preprocessor1_Model2
4            5    rmse    standard    4.87     5  0.0451 Preprocessor1_Model5
5            0.25 rmse    standard    4.88     5  0.0484 Preprocessor1_Model1
gam_final_wf <-
  finalize_workflow(gam_wf,
                    select_best(gam_tune, metric = "rmse"))

gam_resamples <- gam_final_wf |>
  fit_resamples(resamples = pdb_folds)

show_best(gam_resamples)
Warning: No value of `metric` was given; metric 'rmse' will be used.
# A tibble: 1 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard    4.87     5  0.0453 Preprocessor1_Model1

13.10.7 MARS

Finally, let’s explore MARS with hyperparameter tuning. Note the simplicity of the recipe.

Code
mars_rec <- recipe(non_return_rate ~ ., pdb_train) |>
  update_role(State_name, County_name, Low_Response_Score, new_role = "id")

# we will tune the number of knots and the degree of the polynomials in each
# piecewise region
mars_mod <- mars(
  num_terms = tune(), 
  prod_degree = tune()
) |>
  set_mode(mode = "regression") |>
  set_engine(engine = "earth")

mars_wf <- workflow() |>
  add_recipe(recipe = mars_rec) |>
  add_model(spec = mars_mod)

mars_grid <- grid_regular(
  num_terms(range = c(20, 100)),
  prod_degree(),
  levels = 10
)

mars_resamples <- tune_grid(
  mars_wf,
  resamples = pdb_folds,
  grid = mars_grid
)

collect_metrics(mars_resamples)

show_best(mars_resamples)

autoplot(mars_resamples)
mars_rec <- recipe(non_return_rate ~ ., pdb_train) |>
  update_role(State_name, County_name, Low_Response_Score, new_role = "id")

# the best parameters were chosen using hyperparameter tuning
mars_mod <- mars(
  num_terms = 46,
  prod_degree = 2
) |>
  set_mode(mode = "regression") |>
  set_engine(engine = "earth")

mars_wf <- workflow() |>
  add_recipe(recipe = mars_rec) |>
  add_model(spec = mars_mod)

mars_resamples <- mars_wf |>
  fit_resamples(resamples = pdb_folds)

collect_metrics(mars_resamples)
# A tibble: 2 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard   4.75      5 0.0560  Preprocessor1_Model1
2 rsq     standard   0.589     5 0.00768 Preprocessor1_Model1

13.10.8 Final Comparison

Finally, let’s compare the RMSE and \(r\)-squared for linear regression and MARS (with hyperparameter tuning).

bind_rows(
  `Linear regression` = collect_metrics(lm1_resamples) |>
    filter(.metric == "rmse"),
  `Linear regression (Prepped)` = collect_metrics(lm2_resamples) |>
    filter(.metric == "rmse"),  
  `Elastic Net` = collect_metrics(enet_resamples) |>
    filter(.metric == "rmse"),  
  `Polynomial ENet regression` = collect_metrics(poly_enet_resamples) |>
    filter(.metric == "rmse"),
  `GAM` = collect_metrics(gam_resamples) |>
    filter(.metric == "rmse"),  
  `MARS` = collect_metrics(mars_resamples) |>
    filter(.metric == "rmse"),
  .id = "model"
)
# A tibble: 6 × 7
  model                       .metric .estimator  mean     n std_err .config    
  <chr>                       <chr>   <chr>      <dbl> <int>   <dbl> <chr>      
1 Linear regression           rmse    standard    5.33     5  0.0580 Preprocess…
2 Linear regression (Prepped) rmse    standard    4.96     5  0.0484 Preprocess…
3 Elastic Net                 rmse    standard    4.96     5  0.0483 Preprocess…
4 Polynomial ENet regression  rmse    standard    4.73     5  0.0495 Preprocess…
5 GAM                         rmse    standard    4.87     5  0.0453 Preprocess…
6 MARS                        rmse    standard    4.75     5  0.0560 Preprocess…
bind_rows(
  `Linear regression` = collect_metrics(lm1_resamples) |>
    filter(.metric == "rsq"),
  `Linear regression (Prepped)` = collect_metrics(lm2_resamples) |>
    filter(.metric == "rsq"),    
  `Elastic Net` = collect_metrics(enet_resamples) |>
    filter(.metric == "rsq"),   
  `Polynomial ENet regression` = collect_metrics(poly_enet_resamples) |>
    filter(.metric == "rsq"), 
  `GAM` = collect_metrics(gam_resamples) |>
    filter(.metric == "rsq"),  
  `MARS` = collect_metrics(mars_resamples) |>
    filter(.metric == "rsq"),
  .id = "model"
)
# A tibble: 6 × 7
  model                       .metric .estimator  mean     n std_err .config    
  <chr>                       <chr>   <chr>      <dbl> <int>   <dbl> <chr>      
1 Linear regression           rsq     standard   0.485     5 0.00906 Preprocess…
2 Linear regression (Prepped) rsq     standard   0.552     5 0.00642 Preprocess…
3 Elastic Net                 rsq     standard   0.552     5 0.00642 Preprocess…
4 Polynomial ENet regression  rsq     standard   0.593     5 0.00639 Preprocess…
5 GAM                         rsq     standard   0.569     5 0.00569 Preprocess…
6 MARS                        rsq     standard   0.589     5 0.00768 Preprocess…

  1. The Council of Economic Advisers “cubic model” of Covid-19 cases is one high profile example of extrapolating cubic models gone awry.↩︎